spatialHeatmap 2.1.3
[ThG: I have edited/commented sections 1-4. Overall I think you should (1) provide a simple illustration in the introduction that summarizes all 3 methods (annot, manual, co-clustring). Figure 5 can then be removed since it is hard to follow and contains too much information relating to the supplement which we wanted to remove from this vignette. (2) Further please reduce the number of code chunks to make things easier for the user. For clarity you could use meta functions and summarize in the text what they are doing. Often you include the individual steps but then there is no explanation given what they are doning. (3) The optimization part needs to be entirely removed from the main part of this vignette as there is no context given what its relevance is. (4) Please also simplify the Supplement section 5 or remove it from the vignette and perhaps include it in a script file that is then just referenced for expert users in the text. Once this part 5 has been improved I will fine edit it. (5) In addition, please address my comments given in red font throughout the text.]
The primary utility of the spatialHeatmap package is the generation of spatial heatmaps (SHM) for visualizing cell-, tissue- and organ-specific abundance patterns of biological molecules in anatomical images. This is useful for identifying genes with spatially enriched (SE) expression patterns as well as clusters and/or network modules composed of genes sharing similar expression patterns. These functionalities are described in the main vignette of the spatialHeatmap package.
The following describes extended functionalities for integrating bulk tissue with single cell data by co-visualizing them in composite plots that combine spatial heatmaps with embedding plots of high-dimensional data. The main input data required for this co-visualization are quantitative assay data and anatomic images. The assay data can be provided in a variety of widely used tabular data structures (e.g. data.frame, SummarizedExperiment or SingleCellExperiment), while the anatomic images need to be supplied as annotated SVGs (aSVGs). The creation of aSVGs is described in the main vignette of this package. For the embedding plots of single cell data, several dimensionality reduction algorithms (e.g. PCA, UMAP or tSNE) are supported. To associate single cells with their source tissues, the user can choose among three major methods including annotation, manual and automatic (Figure 1). Like most functionalities in spatialHeatmap, these functionalities are available from within R as well as the corresponding Shiny app (Chang et al. 2021, Chang and Borges Ribeiro (2018)).
[ThG: would simplify and generalize this section to all three methods and move it to the introduction at the beginning. Most of what you are discribing here applies to all methods. By adjusting the illustration you can use the illustration for all 3 methods.]
The three methods for associating single cells with bulk tissues are illustrated in Figure 1. The annotation-based and manual methods are similar. The main difference is how the cell labels are generated. In the annotation-based method, cell annotation labels are usually present in an existing SingleCellExperiment that is created by a research group or the community such as the objects from the scRNAseq package (Risso and Cole 2022). While in the manual method, cell labels are created by custom criteria or methods such as clustering methods and stored in a tabular file. In both methods, expression profiles of every molecule (e.g. gene) under the same cell label are aggregated by taking mean or median. Then the aggregated labels are matched to aSVG spatial features (tissues, organs, etc.) according to a matching list.
By contrast, the automatic method assigns bulk tissues to single cells as labels automatically through co-clustering. The most common inputs are raw count data (e.g. RNA-seq) of chosen bulk tissues and single cells of the same organ (Figure 1.1), and the single cells come from the whole organ or at least cover the chosen bulk tissues. Bulk and single cell data are filtered to remove the most noisy data. The main difference between bulk and single cells is the sparsity in the latter. To reduce such difference, the bulk and single-cell data are combined and normalized together, and subsequently separated (Figure 1.2). Next, dimensionality reduction is performed on single-cell data using PCA or UMAP method, where one resultant dimensionality is equivalent to one molecule (e.g. gene). The top dimensionalities are clustered (Figure 1.4) to generate single cell clusters. Each cell cluster is refined by removing cells having low similarities with other cells in the same cluster (Figure 1.5). As a result, each cell cluster is more homogeneous (Figure 1.6). Next, bulk and single cell data are combined and co-clustered (Figure 1.7) to produce co-clusters.
There are three types of co-clusters: 1) Multiple bulk tissues are co-clustered with many cells. Spearman's correlation coefficient (similarity) is calculated for any bulk-cell pair in the same co-cluster. If a bulk has the largest similarity with a specific cell than any other bulk, this bulk is assigned to this cell as source bulk. For example, in Figure 1.8a bulk A is assigned to cell a1 because a1 has higher similarity with A than B; 2). Only one bulk tissue is co-clustered with many cells. This bulk is assigned to all the cells in the same co-cluster (1.8b); and 3) No bulk is present in a cluster. All these cells are not considered for downstream co-visualization (Figure 1.8c), which are potential novel cell populations. After the automatic bulk assignments to cells, cells with the same bulk label are aggregated in the same way as annotation-based or manual method and subsequently the aggregated assay profiles are mapped to aSVG features.
[ThG: ROCs can only be generated if the correct assingments are known. Readers will not understand why this is mentioned here and how they would use the tool since they don't have the true tissue assignments.] JZ: ROC is related to optimzation. The text is updated.
[ThG: also note the text in the above two paragraph needs major revisions with respect to grammar and clarity.]
Figure 1: Overview of co-visualization methods
In the annotation-based method, annotation labels are provided in existing SingleCellExperiment, while in the manual method cell labels are created by users, which is very flexible. In the automatic method, bulk tissues are automatically assigned to single cells as labels in a co-clustering process. In any of these methods, cells under the same label are aggregated by mean or median on assay profiles. Cells are colored by their aggregated expression profiles in embedding plots and aSVG spatial features matching with these cells are painted by the same colors.
The spatialHeatmap package can be installed with the BiocManager::install command.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("spatialHeatmap")
Next, the packages required for running the sample code in this vignette need to be loaded.
library(spatialHeatmap); library(SummarizedExperiment); library(scran); library(scater); library(igraph); library(SingleCellExperiment); library(BiocParallel)
The following lists the vignette(s) of this package in an HTML browser. Clicking the name of the corresponding vignette will open it.
browseVignettes('spatialHeatmap')
To reduce runtime, intermediate results can be cached under ~/.cache/shm.
cache.pa <- '~/.cache/shm' # Set path of the cache directory
Three main methods are developed for assigning cell types from single cell experiments with their source tissues, i.e. annotation-based, manual and automatic. Before detailed demonstration on these methods, a general toy example that summarizes the common features of the three methods is shown, which allows users to have a quick test.
To obtain reproducible results, a fixed seed is set for generating random numbers.
set.seed(10)
Create a toy single cell data by using mockSCE from the scuttle package (McCarthy et al. 2017). The toy data are normalized and log2-transformed (logNormCounts), and subsequently processed by dimensionality reduction methods of PCA, TSNE, and UMAP (reduce_dim).
sce.toy <- mockSCE() # Create toy data.
sce.norm.toy <- logNormCounts(sce.toy) # Normalization and log2-transformation.
sce.dimred.toy <- reduce_dim(sce.norm.toy) # Dimensionality reduction.
The downstream co-visualization is demenstrated on the label of cell cycles (Cell_Cycle), which includes G1, G2M, G0, S.
colData(sce.dimred.toy)[1:2, ]
## DataFrame with 2 rows and 4 columns
## Mutation_Status Cell_Cycle Treatment sizeFactor
## <character> <character> <character> <numeric>
## Cell_001 positive G1 treat2 1.029846
## Cell_002 positive G1 treat2 0.952431
unique(colData(sce.dimred.toy)$Cell_Cycle)
## [1] "G1" "S" "G0" "G2M"
As shown in (Figure 1), to map single cell data onto aSVG features, assay profiles under the same cell label should be aggregated by mean or median (aggr='mean'). The aggregation is performed on log2-transformed normlized data as specified by assay.na='logcounts'. After aggregation, all cells are collapsed into 4 cell cycle labels.
sce.aggr.toy <- aggr_rep(sce.dimred.toy, assay.na='logcounts', sam.factor='Cell_Cycle', con.factor=NULL, aggr='mean')
colData(sce.aggr.toy)
## DataFrame with 4 rows and 4 columns
## Mutation_Status Cell_Cycle Treatment sizeFactor
## <character> <character> <character> <numeric>
## G1 positive G1 treat2 1.029846
## S positive S treat1 0.963111
## G0 positive G0 treat2 1.000982
## G2M negative G2M treat1 0.992205
An example aSVG file of mouse brain is included in spatialHeatmap, and spatial features are extracted.
svg.mus.brain <- system.file("extdata/shinyApp/example", "mus_musculus.brain.svg", package="spatialHeatmap")
feature.df <- return_feature(svg.path=svg.mus.brain)
tail(feature.df$feature) # Partial features are shown.
## [1] "brainstem" "midbrain"
## [3] "dorsal.plus.ventral.thalamus" "hypothalamus"
## [5] "nose" "corpora.quadrigemina"
One collapsed cell cycle is equivalent to one aSVG feature. Before co-visualization, their matching relationship needs to be defined. In annotation-based and manual methods, the matching relationship is defined by users in a named list, while in the automatic method it is automatically defined, which is the essential difference across the three methods. Note, one collapsed cell label can be matched to multiple aSVG features but not vice versa. The following is an example named list.
lis.match.toy <- list(G1=c('hypothalamus'), G0=c('brainstem', 'medulla.oblongata'))
The co-visualization is created on gene Gene_0010. The embedding plot includes all single cells before aggregation. The G1 and G0 cells are colored by their respecitive aggregated expression profiles and indicated in the legend at the bottom. In the nearby spatial heatmap plot, aSVG features are filled by the same color as the matching cells defined in the named list (lis.match.toy).
shm.lis.toy <- spatial_hm(svg.path=svg.mus.brain, data=sce.aggr.toy, ID=c('Gene_0010'), height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=3, sce.dimred=sce.dimred.toy, dimred='PCA', cell.group='Cell_Cycle', assay.na='logcounts', tar.cell=c('matched'), lis.rematch=lis.match.toy, bar.width=0.11, dim.lgd.nrow=1)
Figure 2: Example co-visualization plot
The co-visualization is created on gene Gene_0010. Single cells in the embedding plot and their matching aSVG features in the spatial heatmap are filled by the same color according to aggregated assay profiles within each cell label.
The annotation method allows to provide known cell-to-tissue annotations. These cell labels are stored in the label column of the colData slot in an SCE object, which is usually an existing object created by the community.
[ThG: the Shiny part makes no sense in this annotation section since mouse actions would fall under manual mode. There must be an upload option in the app.]
The following example uses single cell data from oligodendrocytes of mouse brain (Marques et al. 2016). Prior to co-visualizing, the single cell data are pre-processed with standard QC and normalization methods. The details of these steps are described in the OSCA tutorial.
set.seed(10)
The single cell sample data set can be imported as follows.
sce.pa <- system.file("extdata/shinyApp/example", "sce_manual_mouse.rds", package="spatialHeatmap")
sce <- readRDS(sce.pa)
The quatity control (QC), normalization, and dimensionality reduction steps can each be performed with a single command (process_cell_meta).
[ThG: you need to at least briefly state what algorithms/method are used in each step.]
[Short version]
In QC, common per-cell metrics are cacluated such as library size, mitochodrial gene percentage, etc. Then problematic cells are filtered out according to these metrics. Refer to perCellQCMetrics and perCellQCFilters in the scuttle package for more details (McCarthy et al. 2017). In normalization, per-cell size factors are computed using a scaling normalization method followed by a deconvolution strategy. Single cells are finally normalized by these per-cell size factors. See more details in quickCluster, computeSumFactors from the scran package (Lun, McCarthy, and Marioni 2016), and logNormCounts from the scuttle package (McCarthy et al. 2017).
[Long version]
In the QC, frequently used per-cell metrics are calculated for identifying problematic cells, such as library size, number of detected features above a threshold, mitochodrial gene percentage, etc. Then these metrics are used to determine outlier cells based on median-absolute-deviation (MAD). Refer to perCellQCMetrics and perCellQCFilters in the scuttle package for more details (McCarthy et al. 2017). In the normalization, a quick-clustering method is applied to divide cells into clusters. Then a scaling normalization method is performed to obtain per-cluster size factors. Next, the size factor in each cluster is decomposed into per-cell size factors by a deconvolution strategy. Finally, all cells are normalized by per-cell size factors. See more details in quickCluster, computeSumFactors from the scran package (Lun, McCarthy, and Marioni 2016), and logNormCounts from the scuttle package (McCarthy et al. 2017).
In dimensionality reduction, the high-dimensional gene expression data are embedded into a 2-3 dimensional space using PCA, tSNE and UMAP. All three embedding result sets are stored in the SingleCellExperiment object. Details are seen in denoisePCA from scran (Lun, McCarthy, and Marioni 2016), and runUMAP, runTSNE from scater (McCarthy et al. 2017). Subsequently, the UMAP result is visualized as an example in form of a scatter plot where the dots are colored by the corresponding cell labels.
sce.dimred <- process_cell_meta(sce, qc.metric=list(subsets=list(Mt=rowData(sce)$featureType=='mito'), threshold=1))
plotUMAP(sce.dimred, colour_by="label")
Figure 3: Embedding plot single cells
The cells are colored by labels in the label column in colData.
The following command returns a slice of the cell annotation and experiment variable information from the label and expVar column respectively of the colData component of the SingleCellExperiment object, here sce.dimred.
colData(sce.dimred)[25:27, c('label', 'expVar')]
## DataFrame with 3 rows and 2 columns
## label expVar
## <character> <character>
## C1.1772125.078.A03 corpus.callosum control
## C1.1772096.168.C12 hypothalamus 6h.post.stress
## C1.1772117.102.H11 dorsal.horn control
[ThG: First, the previous is an incomplete sentence that should only be used in a title. Second, this section is confusing. What do you mean by 'aggregate'? Isn't this done already by having the annotations available? Aren't you using the existing annotations to compute a summary statistics for each predefined group of cells belonging to a tissue specified by the annotations? Subsequently, the summary values are used to define the colors in the spatial heatmap, correct? If so try to outline this more clearly. The same applies ot the help file of the aggr_rep function. The text in this help file is hard to follow.
Each annotation label in the label column of colData defines a cell group or cluster. The expression profiles of each gene are aggregated by mean (aggr='mean') within each cell group. If experiment variables are provided, here con.factor='expVar', the aggregation will be performed on the new variable as a combination of label and expVar. The aggregation is carried out on normalized data indicated by assay.na='logcounts'.
sce.aggr <- aggr_rep(sce.dimred, assay.na='logcounts', sam.factor='label', con.factor='expVar', aggr='mean')
A slice of aggregated data is shown. hypothalamus and control is a annotation label and experiment variable respectively. The abundance mean of gene Tcea1 under the combined variable hypothalamus__control is 1.440910.
as.matrix(logcounts(sce.aggr)[1:2, 1:2])
## hypothalamus__control SN.VTA__control
## Tcea1 1.440910 1.62474
## Atp6v1h 2.275958 1.95121
Next, the spatial features are extracted with the return_feature function from an aSVG image of mouse brain. This sample aSVG image file is provided by the package.
svg.mus.brain <- system.file("extdata/shinyApp/example", "mus_musculus.brain.svg", package="spatialHeatmap")
feature.df <- return_feature(svg.path=svg.mus.brain)
tail(feature.df$feature)
## [1] "brainstem" "midbrain"
## [3] "dorsal.plus.ventral.thalamus" "hypothalamus"
## [5] "nose" "corpora.quadrigemina"
[ThG: Why is do you select here only one feature/tissue instead of all annotated features. Also the need/utility of this list is unclear.]
In practice, the identifiers of aSVG features and annotation label counterparts are not exactly the same, since they might come from different sources, such as corpus.striatum in aSVG and corpus.callosum in label. To coordinate the differences and give users flexibilities, a named list is required to match cell labels and aSVG features, where cell labels are in the name slots and aSVG features are list elements. The example list is created on three tissues that have same or similar countterparts between aSVG features and cell labels. Note, one cell label could be matched to multiple aSVG features, not vice versa.
lis.match <- list(hypothalamus=c('hypothalamus'), cortex.S1=c('cerebral.cortex'), corpus.callosum=c('corpus.striatum'))
[ThG: there needs to be a detailed explanation or legend explaining what is shown on the plot. Again coloring only one tissue creates the impression that tissues need to be depicted one by one instead of a single plot. I would change this example to highlight most tissues. Most genes are not so selectively expressed to show up in only one tissue, so choosing here an example where this is the case shouldn't be hard.
Co-visualization is created on aggregated expression profiles (sce.aggr). Here gene Cops5 is chosen as an example. The data includes two experiment variables control and 6h.post.stress, thus covisualization plots are created under each condition. The legend under embedding plots shows the cell labels present in the matching list (lis.match). The cell population of the same label under the same condition are colored by their aggregated expression profiles, and the bulk tissue counterpart is colored by the same color in the anatomical image. In the downloaded single cell data, only hypothalamus is present under 6h.post.stress, so only hypothalamus cells and bulk tissues are colored under 6h.post.stress. The tar.cell argument specifies which cell populations to show. It can be one or multiple cell labels, and matched indicates all cell labels in the matching list.
shm.lis <- spatial_hm(svg.path=svg.mus.brain, data=sce.aggr, ID=c('Cops5'), height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=2, sce.dimred=sce.dimred, dimred='PCA', cell.group='label', assay.na='logcounts', tar.cell=c('matched'), lis.rematch=lis.match, bar.width=0.09, dim.lgd.nrow=2)
Figure 4: Co-visualizing single cells and bulk tissues by annotation cell labels
The expression profiles of gene Cops5 are used. Cell population of the same annotation label is colored by their aggregated expression profiles under the same condition, and the bulk tissue counterpart is colored by the same color.
In the manual method, cell group or cluster labels are manually created according to criteria or methods chosen by users such as clustering methods, which is very flexible than the annotation method. The resultant manual clusters need to be stored in a tabular file, then added to the SingleCellExperiment by the manual_cluster function. The same single cell data and aSVG file as in the annotation method are used to demonstrate the manual method. The steps of quality control, normalization, and dimensionality reduction are the same with the annotation method, while a subsequent step of adding manual clusters to the SingleCellExperiment is required.
[ThG: I don't understand why the manual method uses clustering??? If clustering needs to be used here then it is unclear what clustering algorithim is applied.]
An example manual cluster file is included in spatialHeatmap, where cluster labels are created by the clustering function cluster_cell. Refer to the help file for how to use this function. Two columns are required in this file. The cell column contains single cell identifiers present in the rownames of colData slot in the SingleCellExperiment, while the cluster column contains the manual cluster labels.
manual.clus.mus.sc.pa <- system.file("extdata/shinyApp/example", "manual_cluster_mouse_brain.txt", package="spatialHeatmap")
manual.clus.mus.sc <- read.table(manual.clus.mus.sc.pa, header=TRUE, sep='\t')
manual.clus.mus.sc[1:3, ]
## cell cluster
## 1 C1.1772078.029.F11 clus6
## 2 C1.1772089.202.E04 clus6
## 3 C1.1772099.091.D10 clus3
Manual cluster labels are exclusively included in the cluster column in the colData slot by the function manual_cluster, which does not interfere with the label column that stores annotation labels.
sce.clus <- manual_cluster(sce=sce.dimred, df.clus=manual.clus.mus.sc)
colData(sce.clus)[1:3, c('cluster', 'label')]
## DataFrame with 3 rows and 2 columns
## cluster label
## <character> <character>
## C1.1772078.029.F11 clus6 hypothalamus
## C1.1772089.202.E04 clus6 SN.VTA
## C1.1772099.091.D10 clus3 dorsal.horn
Embedding plot of single cells colored by manual clusters in the cluster column in colData.
plotUMAP(sce.clus, colour_by="cluster")
Figure 5: Embedding plot single cells
The cells are labeled by clusters in the cluster column in colData.
The same mouse brain aSVG as the annotation method is used.
tail(feature.df$feature)
## [1] "brainstem" "midbrain"
## [3] "dorsal.plus.ventral.thalamus" "hypothalamus"
## [5] "nose" "corpora.quadrigemina"
Similar with the annotation method, create a named list to match manual cell cluster clus1 with aSVG feature hypothalamus, and cluster clus3 with cerebral.cortex and midbrain. The latter demonstrates one cell cluster is matched to multiple aSVG features.
lis.match.clus <- list('clus1'=c('cerebral.cortex'), 'clus3'=c('hypothalamus', 'midbrain'))
[ThG: same problem as before the aggregate part is unclear.]
Aggregate cells by taking average (aggr='mean') of gene expression profiles within each manual cluster in the cluster column (sam.factor='cluster'). To illustrate aggregation only on cells, the experiment variable is not considered (con.factor=NULL). assay.na='logcounts' indicates the aggregation is based on normalized data.
sce.clus.aggr <- aggr_rep(sce.clus, assay.na='logcounts', sam.factor='cluster', con.factor=NULL, aggr='mean')
Co-visualization plots is built on aggretated data (sce.clus.aggr). Here gene Rpl7 is chosen. The manual clusters in the matching list (lis.match.clus) are colored by within-cluster aggregation of Rpl7 abundance profiles, and indicated by the legends under the embedding plots. The same color of each matched cluster is used to paint the bulk tissue counterpart in the anatomical image.
shm.lis <- spatial_hm(svg.path=svg.mus.brain, data=sce.clus.aggr, ID=c('Rpl7'), height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=3, sce.dimred=sce.clus, dimred='PCA', cell.group='cluster', assay.na='logcounts', tar.cell=c('matched'), lis.rematch=lis.match.clus, bar.width=0.09, dim.lgd.nrow=1)
Figure 6: Co-visualizing single cells and bulk tissues by manual cell clusters
The expression profiles of gene Rpl7 are used. The matched manual clusters are colored by within-cluster aggregation of expression profiles and the same colors are used to color the bulk tissue counterparts in the anatomical image.
[ThG: I don't understand why this section is called 'Manual'?]
In addition to the annatation and manual methods, an automatic method is provided that matches cells and bulk tissues for co-visualization (Figure 1). The automatic process is carried out by combining and co-clustering bulk and single cell data. The most desirable bulk and single cell data would be derived from the same organ or tissue. [ThG: why is that a requirement? You need to better explain what the requirements are here.] If same organ, the chosen bulk tissues are major or complete tissues comprising this organ, while the single cell data are assayed on the region covering at least the chosen bulk tissues or on the whole organ. Similarly, if chosen bulk tissues are sub-tissues composing partial or the complete tissue, the single cells need to cover at least the chosen sub-tissues or the whole tissue. The key point is bulk tissues and single cells have anatomical overlap.
To obtain optimal default settings for the automatic method, this method is optimized and tested on real bulk and single cell datasets. The optimization utilities are included in spatialHeatmap, so users have the choice to optimize this method on their own real data. This section showcases application of the automatic method on mouse brain data with default settings obtained through optimization. The optimization on toy data and real data are explained in an external file [JZ: update in the optimzation section is in progress.] . The bulk RNA-seq data are generated in a research on the impact of placental endocrine on mouse cerebellar development (Vacher et al. 2021) and the scRNA-seq data are from a study of mouse brain molecular atlas (Ortiz et al. 2020). Both bulk and single cell data sets are reduced for demonstration purpose.
To obtain reproducible results, a fixed seed is set for generating random numbers.
set.seed(10)
The example bulk and single cell count data are included in spatialHeatmap and imported. Replicates of the same bulk tissue should have the same identifiers. For example, all replicates of Cerebral Cortex have the label CERE.CORTEX instead of CERE.CORTEX1, CERE.CORTEX2, and so on. Since cells are aggregated based on uniuqe bulk tissue labels in downstream procedures.
# Example bulk data.
blk.mus.pa <- system.file("extdata/shinyApp/example", "bulk_mouse_cocluster.txt", package="spatialHeatmap")
blk.mus <- as.matrix(read.table(blk.mus.pa, header=TRUE, row.names=1, sep='\t', check.names=FALSE))
blk.mus[1:3, 1:5]
## CERE.CORTEX HIPP HYPOTHA CERE CERE.CORTEX
## AI593442 177 256 50 24 285
## Actr3b 513 1465 228 244 666
## Adcy1 701 1243 57 1910 836
# Example single cell data.
sc.mus.pa <- system.file("extdata/shinyApp/example", "cell_mouse_cocluster.txt", package="spatialHeatmap")
sc.mus <- as.matrix(read.table(sc.mus.pa, header=TRUE, row.names=1, sep='\t', check.names=FALSE))
sc.mus[1:3, 1:5]
## isocort isocort isocort isocort olfa
## AI593442 2 2 2 5 0
## Actr3b 3 5 4 4 1
## Adcy1 3 6 6 6 0
Both bulk and single cell data are filtered to remove low quality values. In bulk data, genes are filtered with two filters. The first one is the proportion of counts over a threshold (A) in a gene is at least p (pOA), and the second is the coefficient of variance (CV) is within a specified range. Only genes passing these criteria are retained. Refer to filterfun in the geneflter package for more details (Gentleman et al. 2018). In single cell data, genes and cells are filtered according to proportion of counts over a minimum (min.cnt) is larger than a threshold (p.in.gen for genes and p.in.cell for cells). For instance, min.cnt=1 and p.in.gen=0.1 suggests if the proportion of counts over 1 is at least 0.1 in a gene, this gene is retained.
blk.mus <- filter_data(data=blk.mus, sam.factor=NULL, con.factor=NULL, pOA=c(0.1, 5), CV=c(0.2, 100), verbose=FALSE)
mus.lis <- filter_cell(lis=list(sc.mus=sc.mus), bulk=blk.mus, gen.rm=NULL, min.cnt=1, p.in.cell=0.5, p.in.gen=0.1, verbose=FALSE)
The main different between bulk and single cell data is the sparsity in the latter. To reduce such difference, they are combined and normalized together, then separated.
mus.lis.nor <- read_cache(cache.pa, 'mus.lis.nor')
if (is.null(mus.lis.nor)) {
mus.lis.nor <- norm_multi(dat.lis=mus.lis, cpm=FALSE)
save_cache(dir=cache.pa, overwrite=TRUE, mus.lis.nor)
}
The same aSVG file of mouse brain as the toy example is used. The process of extracting aSVG features is not shown.
tail(feature.df$feature) # Partial features are shown.
## [1] "brainstem" "midbrain"
## [3] "dorsal.plus.ventral.thalamus" "hypothalamus"
## [5] "nose" "corpora.quadrigemina"
The matching between bulk and single cells are automatic, while the matching between aSVG features and bulk tissues needs to be defined by users in a data.frame. Since there is no uniform naming scheme for aSVG features or bulk tissues. In most cases, identifiers of aSVG features and bulk tissues are user dependent, especially when they are from different research groups. The matching data.frame is utilized to coordinate differences of naming schemes between aSVG features and bulk tissues. In addition, it gives flexibility for users to choose desired identifiers.
The matching table for this example is included in spatialHeatmap and imported. The SVGBulk and dataBulk columns are aSVG features and bulk tissues respectively. Note, these two columns names are fixed so that the algorithm is able to recognize them.
[ThG: again users won't have this. The optimization belongs into the supplement.] [JZ: It is related to optimization.]
match.mus.brain.pa <- system.file("extdata/shinyApp/example", "match_mouse_brain_cocluster.txt", package="spatialHeatmap")
df.match.mus.brain <- read.table(match.mus.brain.pa, header=TRUE, row.names=1, sep='\t')
df.match.mus.brain
## SVGBulk dataBulk
## 1 cerebellum CERE
## 2 hippocampus HIPP
## 3 cerebral.cortex CERE.CORTEX
## 4 hippocampus HIPP
## 5 hypothalamus HYPOTHA
The processes of clustering single cells (Figure 1.4), refining cell clusters (Figure 1.5), and co-clustering bulk and single cells (Figure 1.7-8) are performed in a single function call. Setting return.all=TRUE returns results in a list (res.lis).
res.lis <- read_cache(cache.pa, 'res.lis')
if (is.null(res.lis)) {
res.lis <- coclus_meta(bulk=mus.lis.nor$bulk, cell=mus.lis.nor$sc.mus, df.match=df.match.mus.brain, return.all=TRUE, multi.core.par=MulticoreParam(workers=1, RNGseed=50), verbose=FALSE)
res.lis <- res.lis[[1]]
save_cache(dir=cache.pa, overwrite=TRUE, res.lis)
}
The source bulk tissue assignments are stored in the colData slot of sce.asg in the result list (res.lis), which are partialy shown below. The assignedBulk indicates assigned source bulk tissues for each cell, and the SVGBulk contains corresponding aSVG features, which is based on the matching table df.match.mus.brain. The predictor includes similarities (Spearman's correlation coefficient) between single cells and assigned bulk tissues.
colData(res.lis$sce.asg)[1:2, c('cell', 'assignedBulk', 'predictor', 'SVGBulk')]
## DataFrame with 2 rows and 4 columns
## cell assignedBulk predictor SVGBulk
## <character> <character> <character> <character>
## isocort isocort CERE.CORTEX 0.33 cerebral.cortex
## isocort isocort CERE.CORTEX 0.467 cerebral.cortex
The co-cluster information is stored in the colData slot of sce.bulk.cell in the result list (res.lis). The cluster and cell columns indicate co-cluster labels and bulk/cell identifiers respectively.
colData(res.lis$sce.bulk.cell)[1:3, ]
## DataFrame with 3 rows and 2 columns
## cluster cell
## <character> <character>
## CERE.CORTEX clus11 CERE.CORTEX
## HIPP clus20 HIPP
## HYPOTHA clus7 HYPOTHA
Each co-cluster consisting of bulk and cells can be visualized in an embedding plot. The following plot is the visualization of co-cluster clus11, where other bulk and cells are in gray.
plot_dim(res.lis$sce.bulk.cell, dim='PCA', color.by='cluster', group.sel='clus11')
Figure 7: Embedding plot of co-clusters
Dots represent bulk tissues or single cells. The bulk and cells in the target co-cluster are in blue while all other bulk and cells are in gray.
[ThG: users would expect to see a plot next, yet a new subsection starts now that lacks context. Is this an oversight?]
The automatic (co-clustering) method is optimized and tested on a small number of real bulk and single cell datasets, due to the limited availability of single cell datasets where single cells are assayed on a whole organ or whole large tissue and each single cell is well annotated. Thus, the accuracy of automatic source bulk tissue assignments might be also limited. As a remedy, utilities are developed for tailoring the automatic bulk tissue assignments at user preferences. Note, the tailoring step is optional, if users are satisfied with the bulk assignments, just proceed to the co-visualization step.
The tailoring can be performed in command line or on a Shiny app. This section illustrates the command-line based tailoring. First visualize single cells after single-cell cluster refining (Figure 1.5) in an embedding plot as shown below. The x-y axis breaks (x.break, y.break) can be tuned so as to define accurate coordinates in the next step.
plot_dim(res.lis$cell.refined, dim='PCA', color.by='cell', x.break=seq(-10, 10, 2), y.break=seq(-10, 10, 2))
Figure 8: PCA embedding plot of mouse brain single cell data
Single cells right after cluster refining are plotted.
Define desired bulk tissues (desiredSVGBulk) for cells selected by x-y coordinate ranges (x.min, x.max, y.min, y.max) in the embedding plot in form of a data.frame (df.desired.bulk). The dimred reveals where the coordinates come from and are required. In this example, cerebral.cortex is the desired bulk tissue for cells in the two selected regions.
df.desired.bulk <- data.frame(x.min=c(-6, -2), x.max=c(-4, 0), y.min=c(6, 4), y.max=c(8, 6), desiredSVGBulk=c('cerebral.cortex', 'cerebral.cortex'), dimred='PCA')
df.desired.bulk
## x.min x.max y.min y.max desiredSVGBulk dimred
## 1 -6 -4 6 8 cerebral.cortex PCA
## 2 -2 0 4 6 cerebral.cortex PCA
Incorporate desired bulk assignments to co-clustering results by calling refine_asg. The predictors corresponding to desired bulk are internally set at the maximum of 1. The thr argument is a predictor threshold and used to filter bulk assignments.
[ThG: lack of context; hard to follow; does this need to be here?]
res.lis <- refine_asg(res.lis=res.lis, thr=0, df.desired.bulk=df.desired.bulk, df.match=df.match.mus.brain)
Similar with the annotation-based and manual methods, cells under the same SVG feature (bulk tissue) are aggregated by averaging (aggr='mean') normalized assay profiles (assay.na='logcounts'). The aggregated assay profiles are used to color corresponding aSVG features.
[ThG: I am unable follow. Please provide clearer explantions to help readers what is done here or remove this part. The fact that you are not showing any output or visualization in this and other code sections makes it very hard to understand what is happening throughout.]
sce.aggr <- aggr_rep(data=res.lis$sce.asg, assay.na='logcounts', sam.factor='SVGBulk', con.factor=NULL, aggr='mean')
The co-visualization of bulk and single cells is built on aggregated gene abundance profiles (profile=TRUE) of gene Adcy1. Two target aSVG features (bulk tissues) are selected, i.e. cerebellum and cerebral.cortex. In the embedding plot, cells matching the same target aSVG feature in the anatomical image are colored according to their aggregated assay profiles, and the same color is used to fill their matching aSVG features. In the embedding plot, the two small colored clusters on the top are the cells defined in df.desired.bulk and are also counted when aggregating expression profiles.
shm.lis1 <- spatial_hm(svg.path=svg.mus.brain, data=sce.aggr, ID=c('Adcy1'), legend.nrow=4, sce.dimred=res.lis$cell.refined, dimred='PCA', assay.na='logcounts', tar.bulk=c('cerebellum', 'cerebral.cortex'), profile=TRUE, dim.lgd.text.size=10, dim.lgd.nrow=1, bar.width=0.1)
Figure 9: Co-visualizing bulk and single cells of mouse brain with abundance profiles
The aggregated expression profile of gene Adcy1 in cells matching the same bulk is used to fill these cells and bulk tissues.
Bulk tissues and single cells are co-visualized without abundance profiles (profile=FALSE). The aSVG features (bulk tissues) of interest are filled by different colors in the anatomical image, while the matching cells are indicated in the embedding plot with the same color as their source bulk tissues. At the top of embedding plot, the two small light green clusters are the selected cells in the tailoring section.
shm.lis2 <- spatial_hm(svg.path=svg.mus.brain, data=sce.aggr, legend.nrow=4, sce.dimred=res.lis$cell.refined, dimred='PCA', tar.bulk=c('cerebellum', 'cerebral.cortex'), profile=FALSE, dim.lgd.text.size=10, dim.lgd.nrow=1)
Figure 10: Co-visualizing bulk and single cells of mouse brain without abundance profiles
The matching between cells and source bulk tissues (aSVG features) is denoted by the same color between the embedding plot and anatomical image.
[ThG: this section would need to show a screenshot showing where this is used in the shiny app.]
The co-visualization feature is included in the integrated Shiny app that is an GUI implementation of spatialHeatmap, including annotation-based, maual, and automatic methods. To start this app, simply call shiny_shm() in R. Below is a screenshot of the co-visulization output generated by the automatic method.
Figure 11: Screenshot of the co-visualization output in Shiny app
The co-visualization plot is generated by the automatic method.
When using the Shiny app, single cell data in annotation-based and manual methods or combined single cell and bulk data in automatic method are stored in a SingleCellExperiment object and saved in an .rds file by saveRDS, then the .rds file is uploaded to the app. In the manual method, the manual clusters should be included in the cluster column in the colData slot by calling manual_cluster before creating an .rds file, as shown here. In the automatic method, bulk tissues and single cells are labeled by bulk and cell respectively in the bulkCell column in colData slot. The matching table between bulk tissues and aSVG features is stored in the metadata list with the name df.match. The example .rds file below illustrates these rules.
sce.auto <- readRDS(system.file("extdata/shinyApp/example", 'sce_auto_bulk_cell_mouse_brain.rds', package="spatialHeatmap"))
colData(sce.auto)
## DataFrame with 2219 rows and 1 column
## bulkCell
## <character>
## CERE.CORTEX bulk
## HIPP bulk
## HYPOTHA bulk
## CERE bulk
## CERE.CORTEX bulk
## ... ...
## midbrain cell
## midbrain cell
## midbrain cell
## hindbrain cell
## midbrain cell
metadata(sce.auto)$df.match
## SVGBulk dataBulk
## 1 cerebellum CERE
## 2 hippocampus HIPP
## 3 cerebral.cortex CERE.CORTEX
## 4 hippocampus HIPP
## 5 hypothalamus HYPOTHA
[ThG: I don't understand the above paragraph. The shiny part is totally out of context.]
This section describes tailoring co-clustering results on the convenience Shiny app, which is lauched by calling desired_bulk_shiny.
Figure 12 is the screenshot of the Shiny app. The file to upload is an .rds file of a SingleCellExperiment object saved by saveRDS. An example of how to generate such a file is seen in the help file of desired_bulk_shiny. On the left embedding plot, cells are selected with the "Lasso Select" tool. On the right, selected cells and their coordinates are listed in a table, and the desired bulk tissues (aSVG features) can be selected from the dropdown list, here cerebral.cortex. To download the table just click the "Download" button. The "Help" button gives more instructions.
Figure 12: Screenshot of the Shiny app for selecting desired bulk tissues
On the left is the embedding plot of single cells, where target cells are selected with the "Lasso Select" tool. On the right, desired bulk tissues are assigned for selected cell.
An example of desired bulk downloaded from the convenience Shiny app is shown below. The x-y coordinates refer to single cells in embbeding plots (dimred). The df.desired.bulk is ready to use in the tailoring section.
desired.blk.pa <- system.file("extdata/shinyApp/example", "selected_cells_with_desired_bulk.txt", package="spatialHeatmap")
df.desired.bulk <- read.table(desired.blk.pa, header=TRUE, row.names=1, sep='\t')
df.desired.bulk[1:3, ]
## x y key desiredSVGBulk dimred
## 1 4.389422 6.917510 62 cerebellum PCA
## 2 4.586877 8.077207 75 cerebellum PCA
## 3 6.366188 7.695782 76 cerebellum PCA
[JZ: The optimization section will go to an external html file. Its update is in progess.]
Since real optimizations have high demand on computing power and take a long time, it is demonstrated on toy data. Thus the result parameter settings may not be really optimal. The example bulk and single cell RNA-seq data are from Arabidopsis thaliana (Arabidopsis) root. Bulk tissue data comprise all the major root tissues such as epidermis, cortex, endodermis, xylem, columella, which are generated in a research on alternative splicing and lincRNA regulation (S. Li et al. 2016). The two single cell data sets are derived from the whole root, which are produced in a study of single cell Arabidopsis root atlas (Shahan et al. 2020). The identities of bulk and single cells are all labeled.
The optimization focuses on parameters of normalization methods, filtering, dimensionality reduction methods, refining homogeneous cell clusters, number of top dimensionalities in co-clustering, graph-building methods in co-clustering. The optimization is performed by running the co-clustering workflow (Figure 1) on each of the single cell data sets. The parameter settings being optimized is fixed and all settings of other parameters are varied across all possible combinations.
Each running of the workflow yields an AUC value, thus after running the workflow on all possible settings combinations one parameter settings has a set of AUC values. The AUCs are filtered according to some criteria and the remaining AUCs are averaged. A settings with a higher mean AUC than its counterparts are taken as optimal in a parameter. For example, when optimizing dimensionality reduction methods, the settings are PCA and UMAP. If the mean of remaining AUCs of PCA is 0.6 while UMAP is 0.55, PCA is regarded as the optimal.
Since optimzation on example data also takes a relatively long time, most of the following steps are not evaluated. A common computer with 4G memory and 4 CPUs is enough to run the following optimization process.
To obtain reproducible results, always start a new R session and set a fixed seed for Random Number Generator at the beginning, which is required only once in each R session.
set.seed(10)
Read bulk and two single cell data.
blk <- readRDS(system.file("extdata/cocluster/data", "bulk_cocluster.rds", package="spatialHeatmap")) # Bulk.
sc10 <- readRDS(system.file("extdata/cocluster/data", "sc10_cocluster.rds", package="spatialHeatmap")) # Single cell.
sc11 <- readRDS(system.file("extdata/cocluster/data", "sc11_cocluster.rds", package="spatialHeatmap")) # Single cell.
blk; sc10; sc11
## class: SummarizedExperiment
## dim: 2805 45
## metadata(0):
## assays(1): counts
## rownames(2805): AT1G01070 AT1G01120 ... ATCG00650
## ATCG00770
## rowData names(0):
## colnames(45): PHLM_COMP PHLM_COMP ... HAIR CORT
## colData names(0):
## class: SingleCellExperiment
## dim: 577 1893
## metadata(0):
## assays(1): counts
## rownames(577): AT1G01470 AT1G02400 ... AT5G66870
## AT5G67080
## rowData names(0):
## colnames(1893): atricho endo ... atricho cortex
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## class: SingleCellExperiment
## dim: 577 2364
## metadata(0):
## assays(1): counts
## rownames(577): AT1G01470 AT1G02400 ... AT5G66870
## AT5G67080
## rowData names(0):
## colnames(2364): atricho tricho ... endo atricho
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
These example data are already pre-processed. To demonstrate the optimization process the pre-processing steps are perfomed again with few genes or cells removed.
Inital filtering with low strigency before normalization.
blk <- filter_data(data=blk, pOA=c(0.2, 15), CV=c(1.5, 100))
fil.init <- filter_cell(lis=list(sc10=sc10, sc11=sc11), bulk=blk, gen.rm='^ATCG|^ATCG', min.cnt=1, p.in.cell=0.3, p.in.gen=0.1); fil.init
Combine and normalize bulk and single cell data, then separate them. By default computeSumFactors (fct) in scran package is used (Lun, McCarthy, and Marioni 2016). If cpm=TRUE, additional normalization of counts per million is applied.
norm.fct <- norm_multi(dat.lis=fil.init, cpm=FALSE) # fct.
norm.cpm <- norm_multi(dat.lis=fil.init, cpm=TRUE) # fct + cpm
Secondary filtering with higher strigency after normalization. Four sets of filtering parameter settings are created. In bulk data, genes with expression values over A across samples of over proportion p and with coefficinet of variance (CV) between cv1 and cv2 are retained. In cell data, genes with expression values over min.cnt of at least proportion p.in.gen are retained, and cells with with expression values over min.cnt of at least proportion p.in.cell are retained.
df.par.fil <- data.frame(p=c(0.1, 0.2, 0.3, 0.4), A=rep(1, 4), cv1=c(0.1, 0.2, 0.3, 0.4), cv2=rep(100, 4), min.cnt=rep(1, 4), p.in.cell=c(0.1, 0.25, 0.3, 0.35), p.in.gen=c(0.01, 0.05, 0.1, 0.15))
df.par.fil
## p A cv1 cv2 min.cnt p.in.cell p.in.gen
## 1 0.1 1 0.1 100 1 0.10 0.01
## 2 0.2 1 0.2 100 1 0.25 0.05
## 3 0.3 1 0.3 100 1 0.30 0.10
## 4 0.4 1 0.4 100 1 0.35 0.15
Filter bulk and cell data using the four filtering settings. The results are automatically saved in the working directory wk.dir and are recognized in the downstream. Thus the working directory should be the same across the entire workflow.
if (!dir.exists('opt_res')) dir.create('opt_res')
fct.fil.all <- filter_iter(bulk=norm.fct$bulk, cell.lis=list(sc10=norm.fct$sc10, sc11=norm.fct$sc11), df.par.fil=df.par.fil, gen.rm='^ATCG|^ATCG', wk.dir='opt_res', norm.meth='fct')
cpm.fil.all <- filter_iter(bulk=norm.cpm$bulk, cell.lis=list(sc10=norm.cpm$sc10, sc11=norm.cpm$sc11), df.par.fil=df.par.fil, gen.rm='^ATCG|^ATCG', wk.dir='opt_res', norm.meth='cpm')
To evaluate the downstream auto-matching performance, a ground-truth matching relationship is required in form of data.frame. The cell and dataBulk refer to bulk tissue identifiers in aSVG files, single cell identifiers and bulk tissue identifiers in the data for co-clustering, respectively. If a cell matches multiple bulk tissues, bulk identifiers are separated by comma, semicolon, or single space such as NONHAIR,LRC_NONHAIR. The SVGBulk is the bulk identifiers in aSVG files, which are recognized in co-visualization.
match.pa <- system.file("extdata/cocluster/data", "match_arab_root_cocluster.txt", package="spatialHeatmap")
df.match.arab <- read.table(match.pa, header=TRUE, row.names=1, sep='\t')
df.match.arab[1:3, ]
## SVGBulk cell dataBulk
## 1 NONHAIR atricho NONHAIR,LRC_NONHAIR
## 2 COLU colu.dist.colu COLU
## 3 COLU colu.dist.lrc COLU
In real application, the whole optimization takes a long time and requires a lot of computation power. For example, combined bulk and cell data with 6945 genes and 7747 samples requires about 20G memory for coclustering. To speed up computation, the optimization function coclus_opt provides two levels of parallel computing through BiocParallel (Morgan et al. 2021). The first one is BatchtoolsParam and relies on a cluster scheduler such as the SLURM scheduler and the second one utilizes MulticoreParam.
Before optimzation, users could check the parallelization guide by setting parallel.info=TRUE, then it returns the max possible parallelizations for each level respectively.
coclus_opt(wk.dir='opt_res', parallel.info=TRUE, dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.4, by=0.1), sim.p=seq(0.2, 0.4, by=0.1), dim=seq(5, 7, by=1))
A SLURM template is provided as an example for the first level parallelization. Users are advised to make a new copy and set SLURM parameters in the new copy. If users have access to other cluster schedulers, the template should be provided accordingly.
file.copy(system.file("extdata/cocluster", "slurm.tmpl", package="spatialHeatmap"), './slurm.tmpl')
Below is the demonstration of two-level parallelization on SLURM. For instance, the first- and second-level parallelizations are set 3 and 2 cpu cores respectively. The wk.dir is the same in secondary filtering.
sim and sim.p are parameters in refining cell clusters (Figure 1.5). Specifically, in a cell cluster, cells having similarities over sim with other cells in the same cluster of at least proportion sim.p would remain. sim is Spearman' or Pearson's correlation coefficient. dim is the number of top dimensionalities (equivalent to genes) in co-clustering. Since the three parameters are related to each other, they are treated as a set spd.set.
opt <- coclus_opt(wk.dir='opt_res', dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.4, by=0.1), sim.p=seq(0.2, 0.4, by=0.1), dim=seq(5, 7, by=1), df.match=df.match.arab, batch.par=BatchtoolsParam(workers=3, cluster="slurm", template='slurm.tmpl', RNGseed=100, stop.on.error = FALSE, log =TRUE, logdir=file.path('opt_res', 'batch_log')), multi.core.par=MulticoreParam(workers=2), verbose=FALSE)
If no cluster scheduler is available, optimization can be parallelized only at the second-level by setting batch.par=NULL.
opt <- coclus_opt(wk.dir='opt_res', dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.4, by=0.1), sim.p=seq(0.2, 0.4, by=0.1), dim=seq(5, 7, by=1), df.match=df.match.arab, batch.par=NULL, multi.core.par=MulticoreParam(workers=2))
The performace of each combination of parameter settings on each single cell data set is measured by an AUC value in ROC curve. These AUCs are filtered according to a cutoff (aucs over 0.5) and corresponding total bulk assignments (total.min) and total true assignments (true.min). The following demonstrates how to visualize the AUCs and select optimal parameter settings.
Extract AUCs for each filtering settings across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.fil <- auc_stat(wk.dir='opt_res', tar.par='filter', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each filtering settings and AUC cutoff.
df.lis.fil$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.fil[[1]], bar.width=0.07, title='Mean AUCs by filtering settings', x.text.size=15, y.text.size=15, lgd.text.size=15)
Figure 13: Mean AUCs of filtering settings
One bar refers to mean AUCs of a filtering settings at a certain AUC cutoff.
All AUCs by each filtering settings and AUC cutoff.
auc_violin(df.lis=df.lis.fil, xlab='Filtering settings', x.text.size=13, y.text.size=13)
Figure 14: All AUCs of filtering settings
A violin plot refers to all AUCs of a filtering settings at a certain AUC cutoff.
According to the mean AUCs, optimal filtering settings are fil1, fil2, fil3.
df.par.fil[c(1, 2, 3), ]
## p A cv1 cv2 min.cnt p.in.cell p.in.gen
## 1 0.1 1 0.1 100 1 0.10 0.01
## 2 0.2 1 0.2 100 1 0.25 0.05
## 3 0.3 1 0.3 100 1 0.30 0.10
Extract AUCs for normalization methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.norm <- auc_stat(wk.dir='opt_res', tar.par='norm', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each normalization method and AUC cutoff.
df.lis.norm$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.norm[[1]], bar.width=0.07, title='Mean AUCs by normalization methods', x.text.size=15, y.text.size=15, lgd.text.size=15)
All AUCs by each normalization method and AUC cutoff.
auc_violin(df.lis=df.lis.norm, xlab='Normalization methods', x.text.size=13, y.text.size=13)
Optimal normalization method: fct (computeSumFactors).
Extract AUCs for graph-building methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.graph <- auc_stat(wk.dir='opt_res', tar.par='graph', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each graph-building method and AUC cutoff.
df.lis.graph$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.graph[[1]], bar.width=0.07, title='Mean AUCs by graph-building methods')
All AUCs by each graph-building method and AUC cutoff.
auc_violin(df.lis=df.lis.graph, xlab='Graph-building methods')
Optimal graph-building methods: knn (buildKNNGraph).
Extract AUCs for dimensionality reduction methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.dimred <- auc_stat(wk.dir='opt_res', tar.par='dimred', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each dimensionality reduction method and AUC cutoff.
df.lis.dimred$df.auc.mean[1:3, ]
# Mean AUCs by each dimensionality reduction method and AUC cutoff.
mean_auc_bar(df.lis.dimred[[1]], bar.width=0.07, title='Mean AUCs by dimensionality reduction methods')
All AUCs by each dimensionality reduction method and AUC cutoff.
auc_violin(df.lis=df.lis.dimred, xlab='Dimensionality reduction')
Optimal dimensionality reduction method: pca (denoisePCA).
Extract AUCs for spd.set across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.spd <- auc_stat(wk.dir='opt_res', tar.par='spd.set', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
df.lis.spd$auc0.5$df.frq[1:3, ]
All AUCs of top five spd.sets ranked by frequency across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
spd_auc_violin(df.lis=df.lis.spd, n=5, xlab='spd.sets', x.vjust=0.6)
Top five spd.sets across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively are taken as optimal spd.sets.
n <- 5; df.spd.opt <- NULL
for (i in df.lis.spd) {
df.spd.opt <- rbind(df.spd.opt, i$df.frq[seq_len(n), c('sim', 'sim.p', 'dim')])
}
df.spd.opt$spd.set <- paste0('s', df.spd.opt$sim, 'p', df.spd.opt$sim.p, 'd', df.spd.opt$dim)
df.spd.opt <- subset(df.spd.opt, !duplicated(spd.set))
df.spd.opt[1:3, ]
In real application, the optimized settings need to be validated on data sets from other organs of different species, which is presented below.
Ideally, the co-clustering should be optimized on different organs from different organisms as many possible. The single cell data need to be generated on whole organs and each cell's identity need to be labeled. Such data are less common and not easy to obtain in public databases, since most single cell RNA-seq (scRNA-seq) assays only focus on specific cell populations rather than whole organs, which are isolated by microdissection or fluorescent assisted cell sorting (FACS). As a result, the co-clustering optimization is performed only on five single cell data sets of Arabidopsis thaliana (Arabidopsis) root. The optimized parameter settings are validated on mouse brain and kidney.
The optimization in real case has high demand on computing power and takes a long time, so most of the following steps are not evaluated. The following steps are not explained in details since they are the same as last section.
The bulk (S. Li et al. 2016) and five single cell (Shahan et al. 2020) data sets of Arabidopsis root are accessed from the same studies as last section. Details about how to access and format them are described here. In the following, blk.arb.rt refers to bulk data and sc.arab.rt10, sc.arab.rt11, sc.arab.rt12, sc.arab.rt30, sc.arab.rt31 refers to the five single cell data sets respectively.
To obtain reproducible results, always start a new R session and set a fixed seed for Random Number Generator at the beginning, which is required only once in each R session.
set.seed(10)
Inital filtering with low strigency before normalization.
blk.arab.rt <- filter_data(data=blk.arab.rt, pOA=c(0.05, 5), CV=c(0.05, 100))
fil.init <- filter_cell(lis=list(sc10=sc.arab.rt10, sc11=sc.arab.rt11, sc12=sc.arab.rt12, sc30=sc.arab.rt30, sc31=sc.arab.rt31), bulk=blk, gen.rm='^ATCG|^ATCG', min.cnt=1, p.in.cell=0.01, p.in.gen=0.05); fil.init
Combine and normalize bulk and single cell data, then separate them.
norm.fct <- norm_multi(dat.lis=fil.init, cpm=FALSE) # fct.
norm.cpm <- norm_multi(dat.lis=fil.init, cpm=TRUE) # fct + cpm
Secondary filtering with higher strigency after normalization. Four sets of filtering parameter settings are created.
df.par.fil <- data.frame(p=c(0.1, 0.2, 0.3, 0.4), A=rep(1, 4), cv1=c(0.1, 0.2, 0.3, 0.4), cv2=rep(100, 4), min.cnt=rep(1, 4), p.in.cell=c(0.1, 0.25, 0.3, 0.35), p.in.gen=c(0.01, 0.05, 0.1, 0.15))
df.par.fil
## p A cv1 cv2 min.cnt p.in.cell p.in.gen
## 1 0.1 1 0.1 100 1 0.10 0.01
## 2 0.2 1 0.2 100 1 0.25 0.05
## 3 0.3 1 0.3 100 1 0.30 0.10
## 4 0.4 1 0.4 100 1 0.35 0.15
Filter bulk and cell data using the four filtering settings. The results are automatically saved in the working directory wk.dir and are recognized in the downstream. Thus the working directory should be the same across the entire workflow.
if (!dir.exists('opt_real_res')) dir.create('opt_real_res')
fct.fil.all <- filter_iter(bulk=norm.fct$bulk, cell.lis=list(sc10=norm.fct$sc10, sc11=norm.fct$sc11, sc12=norm.fct$sc12, sc30=norm.fct$sc30, sc31=norm.fct$sc31), df.par.fil=df.par.fil, gen.rm='^ATCG|^ATCG', wk.dir='opt_real_res', norm.meth='fct')
cpm.fil.all <- filter_iter(bulk=norm.cpm$bulk, cell.lis=list(sc10=norm.cpm$sc10, sc11=norm.cpm$sc11, sc12=norm.cpm$sc12, sc30=norm.cpm$sc30, sc31=norm.cpm$sc31), df.par.fil=df.par.fil, gen.rm='^ATCG|^ATCG', wk.dir='opt_real_res', norm.meth='cpm')
Ground-truth matching relationship across cell, dataBulk, and SVGBulk.
match.pa <- system.file("extdata/cocluster/data", "match_arab_root_cocluster.txt", package="spatialHeatmap")
df.match.arab <- read.table(match.pa, header=TRUE, row.names=1, sep='\t')
df.match.arab[1:3, ]
## SVGBulk cell dataBulk
## 1 NONHAIR atricho NONHAIR,LRC_NONHAIR
## 2 COLU colu.dist.colu COLU
## 3 COLU colu.dist.lrc COLU
The max possible parallelizations for each level respectively.
coclus_opt(wk.dir='opt_real_res', parallel.info=TRUE, dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.8, by=0.1), sim.p=seq(0.2, 0.8, by=0.1), dim=seq(5, 40, by=1))
Take the SLURM scheduler as an example for two-level parallelizatio. Make a new copy of the default SLURM template and set parameters in the new copy.
file.copy(system.file("extdata/cocluster", "slurm.tmpl", package="spatialHeatmap"), './slurm.tmpl')
The first- and second-level parallelizations are set 3 and 2 cpu cores respectively. The wk.dir is the same in secondary filtering. Note the settings of spd.set (sim/sim.p/dim) has wider ranges than in last section. The parallel computation is performed at High-Performance Computing Center (HPCC) at University of California, Riverside.
opt <- coclus_opt(wk.dir='opt_real_res', dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.8, by=0.1), sim.p=seq(0.2, 0.8, by=0.1), dim=seq(5, 40, by=1), df.match=df.match.arab, batch.par=BatchtoolsParam(workers=3, cluster="slurm", template='slurm.tmpl', RNGseed=100, stop.on.error=FALSE, log=TRUE, logdir=file.path('opt_res', 'batch_log')), multi.core.par=MulticoreParam(workers=2), verbose=FALSE)
If no cluster scheduler is not available, optimization can be parallelized only at the second-level by setting batch.par=NULL.
opt <- coclus_opt(wk.dir='opt_real_res', dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.8, by=0.1), sim.p=seq(0.2, 0.8, by=0.1), dim=seq(5, 40, by=1), df.match=df.match.arab, batch.par=NULL, multi.core.par=MulticoreParam(workers=2))
The performace of each combination of parameter settings on each single cell data set is measured by an AUC value in ROC curve. These AUCs are filtered according to a cutoff (aucs over 0.5) and corresponding total bulk assignments (total.min) and total true assignments (true.min). The following demonstrates how to visualize the AUCs and select optimal parameter settings.
Extract AUCs for each filtering settings across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.fil <- auc_stat(wk.dir='opt_real_res', tar.par='filter', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each filtering settings and AUC cutoff.
df.lis.fil$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.fil[[1]], bar.width=0.07, title='Mean AUCs by filtering settings')
Figure 15: Mean AUCs of filtering settings in real optimization
A bar refers to mean AUCs of a filtering settings at a certain AUC cutoff.
All AUCs by each filtering settings and AUC cutoff.
auc_violin(df.lis=df.lis.fil, xlab='Filtering settings')
Figure 16: All AUCs of filtering settings in real optimization
A violin plot refers to all AUCs of a filtering settings at a AUC cutoff.
According to the mean AUCs, fil1, fil2, and fil3 are selected as optimal filtering settings.
df.par.fil[c(1, 2, 3), ]
## p A cv1 cv2 min.cnt p.in.cell p.in.gen
## 1 0.1 1 0.1 100 1 0.10 0.01
## 2 0.2 1 0.2 100 1 0.25 0.05
## 3 0.3 1 0.3 100 1 0.30 0.10
Extract AUCs for normalization methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.norm <- auc_stat(wk.dir='opt_real_res', tar.par='norm', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each normalization method and AUC cutoff.
df.lis.norm$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.norm[[1]], bar.width=0.07, title='Mean AUCs by normalization methods')
All AUCs by each normalization method and AUC cutoff.
auc_violin(df.lis=df.lis.norm, xlab='Normalization methods')
Optimal normalization method: fct (computeSumFactors).
Extract AUCs for graph-building methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.graph <- auc_stat(wk.dir='opt_real_res', tar.par='graph', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each graph-building method and AUC cutoff.
df.lis.graph$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.graph[[1]], bar.width=0.07, title='Mean AUCs by graph-building methods')
All AUCs by each graph-building method and AUC cutoff.
auc_violin(df.lis=df.lis.graph, xlab='Graph-building methods')
Since knn (buildKNNGraph) and snn (buildSNNGraph) have similar mean AUCs, both are selected as optimal graph-building methods.
Extract AUCs for dimensionality reduction methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.dimred <- auc_stat(wk.dir='opt_real_res', tar.par='dimred', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each dimensionality reduction method and AUC cutoff.
df.lis.dimred$df.auc.mean[1:3, ]
# Mean AUCs by each dimensionality reduction method and AUC cutoff.
mean_auc_bar(df.lis.dimred[[1]], bar.width=0.07, title='Mean AUCs by dimensionality reduction methods')
All AUCs by each dimensionality reduction method and AUC cutoff.
auc_violin(df.lis=df.lis.dimred, xlab='Dimensionality reduction')
Optimal dimensionality reduction method: pca (denoisePCA).
Extract AUCs for spd.set across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.spd <- auc_stat(wk.dir='opt_real_res', tar.par='spd.set', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
df.lis.spd$auc0.5$df.frq[1:3, ]
All AUCs of top five spd.sets ranked by frequency across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
spd_auc_violin(df.lis=df.lis.spd, n=5, xlab='spd.sets', x.vjust=0.6)
Top five spd.sets across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively are taken as optimal spd.sets. s, p, d stands for sim, sim.p, dim respectively. E.g. s0.2p0.5d12 means sim = 0.2, sim.p = 0.5, dim = 12.
n <- 5; df.spd.opt <- NULL
for (i in df.lis.spd) {
df.spd.opt <- rbind(df.spd.opt, i$df.frq[seq_len(n), c('sim', 'sim.p', 'dim')])
}
df.spd.opt$spd.set <- paste0('s', df.spd.opt$sim, 'p', df.spd.opt$sim.p, 'd', df.spd.opt$dim)
df.spd.opt <- subset(df.spd.opt, !duplicated(spd.set))
df.spd.opt[1:3, ]
## sim sim.p dim spd.set
## 1 0.2 0.2 5 s0.2p0.2d5
## 22 0.2 0.5 5 s0.2p0.5d5
## 57 0.2 0.3 6 s0.2p0.3d6
The optimal parameter settings at this stage are listed in the table below.
| normalization | filtering.set | dimensionality.reduction | graph.building | spd.set |
|---|---|---|---|---|
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.2d5 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.5d5 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.3d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.6d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.5d7 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.2d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.3d7 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.4p0.2d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.3d7 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.7d7 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.8d13 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.8d10 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.5d16 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d14 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.4p0.2d13 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d12 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d13 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.2d17 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.7d17 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.4p0.2d12 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.7d16 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.2d23 |
Next, these optimal settings are validated on mouse brain, mouse kindney, and Arabidopsis root data sets. In mouse brain, the bulk RNA-seq data are generated in a research on the impact of placental endocrine on mouse cerebellar development (Vacher et al. 2021) and the scRNA-seq data are from a study of mouse brain molecular atlas (Ortiz et al. 2020). The bulk count data are produced using systemPipeR (2.1.12) (Backman and Girke 2016). Details about how to access and format bulk and single data are described here. In the following, blk.mus.brain and sc.mus.brain refers to bulk and single cell data respectively. The validation is performed by applying these optimal settings on the same coclustering workflow, so the following procedures are not detailed.
Initial filtering.
blk.mus.brain <- filter_data(data=blk.mus.brain, pOA=c(0.05, 5), CV=c(0.05, 100))
mus.brain.lis <- filter_cell(lis=list(sc.mus=sc.mus.brain), bulk=blk.mus.brain, gen.rm=NULL, min.cnt=1, p.in.cell=0.01, p.in.gen=0.05, verbose=FALSE)
Bulk and single cell are combined and normalized, then separated.
mus.brain.lis.nor <- norm_multi(dat.lis=mus.brain.lis, cpm=FALSE)
Secondary filtering. Since fil1 and fil2 exhibit similar performaces, only fil1 is used.
blk.mus.brain.fil <- filter_data(data=mus.brain.lis.nor$bulk, pOA=c(0.1, 1), CV=c(0.1, 100), verbose=FALSE)
mus.brain.lis.fil <- filter_cell(lis=list(sc.mus=mus.brain.lis.nor$sc.mus), bulk=blk.mus.brain.fil, gen.rm=NULL, min.cnt=1, p.in.cell=0.1, p.in.gen=0.01, verbose=FALSE)
Matching table indicating true bulk tissues of each cell type and corresponding SVG bulk (spatial feature).
match.mus.brain.pa <- system.file("extdata/shinyApp/example", "match_mouse_brain_cocluster.txt", package="spatialHeatmap")
df.match.mus.brain <- read.table(match.mus.brain.pa, header=TRUE, row.names=1, sep='\t')
df.match.mus.brain
## SVGBulk dataBulk
## 1 cerebellum CERE
## 2 hippocampus HIPP
## 3 cerebral.cortex CERE.CORTEX
## 4 hippocampus HIPP
## 5 hypothalamus HYPOTHA
Since knn and snn display similar performances, only knn is used. All optimal spd.set settings in Table 1 are tested, and results are shown in Figure 17a.
mus.brain.df.para <- coclus_meta(bulk=mus.brain.lis.fil$bulk, cell=mus.brain.lis.fil$sc.mus, df.match=df.match.mus.brain, df.para=df.spd.opt[, c('sim', 'sim.p', 'dim')], graph.meth='knn', dimred='PCA', return.all=FALSE, multi.core.par=MulticoreParam(workers=2))
In mouse kidney, four bulk tissues are selected: proximal straight tubule in cortical medullary rays (PTS2), cortical collecting duct (CCD), and cortical thick ascending limb of the loop of Henle (cTAL), glomerulus. PTS2 data are from a research on cell-type selective markers in mouse kidney (Clark et al. 2019), CCD and cTAL are from transcriptome analysis of major renal collecting duct cell types in mouse kidney (Chen et al. 2017), and glomerulus is from a transcriptome atlas study of mouse glomerulus (Karaiskos et al. 2018). The FASTQ files of the four tissues are downloaded from original studies and raw count data are generated with systemPipeR (2.1.12) (Backman and Girke 2016). The single cell data are accessed from an investigation in cellular targets of mouse kidney metabolic acidosis (Park et al. 2018). Details about how to access and format bulk and single data are described here.
The validating procedures on mouse kindey are same with mouse brain except that after initial filtering replicates in each bulk are reduced to 3 by using function reduce_rep due to two many replicates. The results are shown in Figure 17b.
In Arabidopsis root, the same bulk tissues (S. Li et al. 2016) and two additional single cell data sets (sc9, sc51, (Shahan et al. 2020)) from the same studies as real optimization are used. Details about how to access and format them are described here. The procedures of validating optimized settings are the same with mouse brain except that in normalization two single cell data sets are used instead of one. The results are shown in Figure 17c and d.
As comparisons, random combinations of non-optimal settings are generated and tested. In filtering, fil4 is regarded non-optimal, but it filters out too many genes so that the coclustering procedures cannot run. Thus fil3 is used in the random settings. The graph-building methods have two settings knn and snn, and both are taken as optimal, thus they are all used for generating random combinations. See details here.
df.par.rdn <- random_para(fil.set=c('fil3'), norm='cpm', dimred='UMAP', graph.meth=c('knn', 'snn'), sim=round(seq(0.2, 0.8, by=0.1), 1), sim.p=round(seq(0.2, 0.8,by=0.1), 1), dim=seq(5, 40, by=1), df.spd.opt=df.spd.opt)
df.par.rdn[1:3, ]
These random settings are tested on each of the four validating data sets, where other settings such as initial filtering are not changed. The results are shown in Figure 18c and d.
The AUCs of optimal and random settings are presented in Figure 17 and Figure 18 respectively. In both figures, if total bulk assignments < 500 or total true assignments < 300 or AUC < 0.5, AUCs are set 0. It is clear that the optimal settings exhibit better performance than random settings, so the optimization workflow and results are reliable to some extent. In Figure 17, asterisks indicate optimal settings have AUCs >= 0.5, total bulk assignments >= 500, and total true assignments >= 300 across all four data sets. These settings are regarded as final optimal settings (Table 2).
Figure 17: Validating optimal settings
AUCs of optimal settings on each validating data sets. A bar refers to one AUC of one optimal settings. Asterisks indicate optimal settings have AUC >= 0.5, total bulk assignments >= 500, and total true assignments >= 300 across all four data sets. If total bulk assignments < 500 or total true assignments < 300 or AUC < 0.5, AUCs are set 0. (a) Mouse brain. (b) Mouse kidndy. (c) Arabidopsis root of single cell 9. (d) Arabidopsis root of single cell 51.
Figure 18: Random settings
AUCs of random settings on each validating data sets. A bar refers to one AUC of one random settings. If total bulk assignments < 500 or total true assignments < 300 or AUC < 0.5, AUCs are set 0. (a) Mouse brain. (b) Mouse kidndy. (c) Arabidopsis root of single cell 9. (d) Arabidopsis root of single cell 51.
| normalization | filtering.set | dimensionality.reduction | graph.building | spd.set |
|---|---|---|---|---|
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.5d5 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.6d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.5d7 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.2d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.3d7 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.8d13 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.5d16 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d14 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d12 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d13 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.2d17 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.7d17 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.2d23 |
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets
## [7] methods base
##
## other attached packages:
## [1] spatialHeatmap_2.1.3 scRNAseq_2.10.0
## [3] knitr_1.39 BiocParallel_1.30.0
## [5] igraph_1.3.1 scater_1.24.0
## [7] ggplot2_3.3.6 scran_1.24.0
## [9] scuttle_1.6.0 SingleCellExperiment_1.18.0
## [11] SummarizedExperiment_1.26.1 Biobase_2.56.0
## [13] GenomicRanges_1.48.0 GenomeInfoDb_1.32.1
## [15] IRanges_2.30.0 S4Vectors_0.34.0
## [17] BiocGenerics_0.42.0 MatrixGenerics_1.8.0
## [19] matrixStats_0.62.0 BiocStyle_2.24.0
## [21] nvimcom_0.9-25
##
## loaded via a namespace (and not attached):
## [1] rappdirs_0.3.3 rtracklayer_1.56.0
## [3] tidyr_1.2.0 bit64_4.0.5
## [5] irlba_2.3.5 DelayedArray_0.22.0
## [7] data.table_1.14.2 rpart_4.1.16
## [9] AnnotationFilter_1.20.0 KEGGREST_1.36.0
## [11] RCurl_1.98-1.6 doParallel_1.0.17
## [13] generics_0.1.2 GenomicFeatures_1.48.0
## [15] preprocessCore_1.58.0 ScaledMatrix_1.4.0
## [17] cowplot_1.1.1 RSQLite_2.2.13
## [19] bit_4.0.4 xml2_1.3.3
## [21] httpuv_1.6.5 assertthat_0.2.1
## [23] viridis_0.6.2 xfun_0.30
## [25] hms_1.1.1 jquerylib_0.1.4
## [27] evaluate_0.15 promises_1.2.0.1
## [29] fansi_1.0.3 restfulr_0.0.13
## [31] progress_1.2.2 caTools_1.18.2
## [33] dbplyr_2.1.1 DBI_1.1.2
## [35] geneplotter_1.74.0 htmlwidgets_1.5.4
## [37] purrr_0.3.4 ellipsis_0.3.2
## [39] RSpectra_0.16-1 dplyr_1.0.9
## [41] backports_1.4.1 bookdown_0.26
## [43] annotate_1.74.0 biomaRt_2.52.0
## [45] sparseMatrixStats_1.8.0 vctrs_0.4.1
## [47] ensembldb_2.20.1 cachem_1.0.6
## [49] withr_2.5.0 checkmate_2.1.0
## [51] GenomicAlignments_1.32.0 prettyunits_1.1.1
## [53] cluster_2.1.3 ExperimentHub_2.4.0
## [55] grImport_0.9-5 lazyeval_0.2.2
## [57] crayon_1.5.1 genefilter_1.78.0
## [59] labeling_0.4.2 edgeR_3.38.0
## [61] pkgconfig_2.0.3 ProtGenerics_1.28.0
## [63] vipor_0.4.5 nnet_7.3-17
## [65] rlang_1.0.2 lifecycle_1.0.1
## [67] filelock_1.0.2 BiocFileCache_2.4.0
## [69] rsvd_1.0.5 AnnotationHub_3.4.0
## [71] av_0.7.0 rsvg_2.3.1
## [73] rngtools_1.5.2 Matrix_1.4-1
## [75] Rhdf5lib_1.18.0 base64enc_0.1-3
## [77] beeswarm_0.4.0 png_0.1-7
## [79] viridisLite_0.4.0 rjson_0.2.21
## [81] bitops_1.0-7 shinydashboard_0.7.2
## [83] KernSmooth_2.23-20 visNetwork_2.1.0
## [85] rhdf5filters_1.8.0 pROC_1.18.0
## [87] Biostrings_2.64.0 blob_1.2.3
## [89] DelayedMatrixStats_1.18.0 doRNG_1.8.2
## [91] stringr_1.4.0 jpeg_0.1-9
## [93] gridGraphics_0.5-1 rols_2.24.0
## [95] beachmat_2.12.0 scales_1.2.0
## [97] memoise_2.0.1 magrittr_2.0.3
## [99] plyr_1.8.7 gplots_3.1.3
## [101] distinct_1.8.0 zlibbioc_1.42.0
## [103] compiler_4.2.0 dqrng_0.3.0
## [105] BiocIO_1.6.0 RColorBrewer_1.1-3
## [107] DESeq2_1.36.0 Rsamtools_2.12.0
## [109] cli_3.3.0 XVector_0.36.0
## [111] htmlTable_2.4.0 Formula_1.2-4
## [113] MASS_7.3-57 WGCNA_1.71
## [115] tidyselect_1.1.2 stringi_1.7.6
## [117] highr_0.9 yaml_2.3.5
## [119] BiocSingular_1.12.0 locfit_1.5-9.5
## [121] latticeExtra_0.6-29 ggrepel_0.9.1
## [123] grid_4.2.0 sass_0.4.1
## [125] tools_4.2.0 parallel_4.2.0
## [127] rstudioapi_0.13 bluster_1.6.0
## [129] foreach_1.5.2 foreign_0.8-82
## [131] metapod_1.4.0 gridExtra_2.3
## [133] Rtsne_0.16 farver_2.1.0
## [135] digest_0.6.29 BiocManager_1.30.17
## [137] FNN_1.1.3 shiny_1.7.2
## [139] Rcpp_1.0.8.3 BiocVersion_3.15.2
## [141] later_1.3.0 RcppAnnoy_0.0.19
## [143] httr_1.4.3 ggdendro_0.1.23
## [145] AnnotationDbi_1.58.0 colorspace_2.0-3
## [147] XML_3.99-0.9 splines_4.2.0
## [149] uwot_0.1.11 yulab.utils_0.0.4
## [151] statmod_1.4.36 ggplotify_0.1.0
## [153] plotly_4.10.0 xtable_1.8-4
## [155] jsonlite_1.8.0 dynamicTreeCut_1.63-1
## [157] UpSetR_1.4.0 flashClust_1.01-2
## [159] R6_2.5.1 Hmisc_4.7-0
## [161] pillar_1.7.0 htmltools_0.5.2
## [163] mime_0.12 glue_1.6.2
## [165] fastmap_1.1.0 BiocNeighbors_1.14.0
## [167] interactiveDisplayBase_1.34.0 codetools_0.2-18
## [169] utf8_1.2.2 lattice_0.20-45
## [171] bslib_0.3.1 tibble_3.1.7
## [173] curl_4.3.2 ggbeeswarm_0.6.0
## [175] gtools_3.9.2 magick_2.7.3
## [177] GO.db_3.15.0 survival_3.3-1
## [179] limma_3.52.0 rmarkdown_2.14
## [181] munsell_0.5.0 fastcluster_1.2.3
## [183] rhdf5_2.40.0 GenomeInfoDbData_1.2.8
## [185] iterators_1.0.14 HDF5Array_1.24.0
## [187] impute_1.70.0 reshape2_1.4.4
## [189] gtable_0.3.0
This project has been funded by NSF awards: PGRP-1546879, PGRP-1810468, PGRP-1936492.
Backman, Tyler W. H., and Thomas Girke. 2016. “SystemPipeR: NGS Workflow and Report Generation Environment.” BMC Bioinformatics 17 (1). Springer Science; Business Media LLC. doi:10.1186/s12859-016-1241-0.
Chang, Winston, and Barbara Borges Ribeiro. 2018. Shinydashboard: Create Dashboards with ’Shiny’. https://CRAN.R-project.org/package=shinydashboard.
Chang, Winston, Joe Cheng, JJ Allaire, Cars on Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara Borges. 2021. Shiny: Web Application Framework for R. https://CRAN.R-project.org/package=shiny.
Chen, Lihe, Jae Wook Lee, Chung-Lin Chou, Anil V Nair, Maria A Battistone, Teodor G Păunescu, Maria Merkulova, et al. 2017. “Transcriptomes of Major Renal Collecting Duct Cell Types in Mouse Identified by Single-Cell RNA-seq.” Proc. Natl. Acad. Sci. U. S. A. 114 (46): E9989–E9998.
Clark, Jevin Z, Lihe Chen, Chung-Lin Chou, Hyun Jun Jung, Jae Wook Lee, and Mark A Knepper. 2019. “Representation and Relative Abundance of Cell-Type Selective Markers in Whole-Kidney RNA-Seq Data.” Kidney Int. 95 (4): 787–96.
Gentleman, R, V Carey, W Huber, and F Hahne. 2018. “Genefilter: Methods for Filtering Genes from High-Throughput Experiments.” http://bioconductor.uib.no/2.7/bioc/html/genefilter.html.
Karaiskos, Nikos, Mahdieh Rahmatollahi, Anastasiya Boltengagen, Haiyue Liu, Martin Hoehne, Markus Rinschen, Bernhard Schermer, et al. 2018. “A Single-Cell Transcriptome Atlas of the Mouse Glomerulus.” J. Am. Soc. Nephrol. 29 (8): 2060–8.
Li, Song, Masashi Yamada, Xinwei Han, Uwe Ohler, and Philip N Benfey. 2016. “High-Resolution Expression Map of the Arabidopsis Root Reveals Alternative Splicing and lincRNA Regulation.” Dev. Cell 39 (4): 508–22.
Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-Seq Data with Bioconductor.” F1000Res. 5: 2122. doi:10.12688/f1000research.9501.2.
Marques, Sueli, Amit Zeisel, Simone Codeluppi, David van Bruggen, Ana Mendanha Falcão, Lin Xiao, Huiliang Li, et al. 2016. “Oligodendrocyte Heterogeneity in the Mouse Juvenile and Adult Central Nervous System.” Science 352 (6291): 1326–9.
McCarthy, Davis J., Kieran R. Campbell, Aaron T. L. Lun, and Quin F. Willis. 2017. “Scater: Pre-Processing, Quality Control, Normalisation and Visualisation of Single-Cell RNA-Seq Data in R.” Bioinformatics 33 (8): 1179–86. doi:10.1093/bioinformatics/btw777.
Morgan, Martin, Jiefei Wang, Valerie Obenchain, Michel Lang, Ryan Thompson, and Nitesh Turaga. 2021. BiocParallel: Bioconductor Facilities for Parallel Evaluation. https://github.com/Bioconductor/BiocParallel.
Ortiz, Cantin, Jose Fernandez Navarro, Aleksandra Jurek, Antje Märtin, Joakim Lundeberg, and Konstantinos Meletis. 2020. “Molecular Atlas of the Adult Mouse Brain.” Science Advances 6 (26). American Association for the Advancement of Science: eabb3446.
Park, Jihwan, Rojesh Shrestha, Chengxiang Qiu, Ayano Kondo, Shizheng Huang, Max Werth, Mingyao Li, Jonathan Barasch, and Katalin Suszták. 2018. “Single-Cell Transcriptomics of the Mouse Kidney Reveals Potential Cellular Targets of Kidney Disease.” Science 360 (6390): 758–63.
Risso, Davide, and Michael Cole. 2022. ScRNAseq: Collection of Public Single-Cell RNA-Seq Datasets.
Shahan, Rachel, Che-Wei Hsu, Trevor M Nolan, Benjamin J Cole, Isaiah W Taylor, Anna Hendrika Cornelia Vlot, Philip N Benfey, and Uwe Ohler. 2020. “A Single Cell Arabidopsis Root Atlas Reveals Developmental Trajectories in Wild Type and Cell Identity Mutants.” BioRxiv.
Vacher, Claire-Marie, Helene Lacaille, Jiaqi J O’Reilly, Jacquelyn Salzbank, Dana Bakalar, Sonia Sebaoui, Philippe Liere, et al. 2021. “Placental Endocrine Function Shapes Cerebellar Development and Social Behavior.” Nat. Neurosci. 24 (10). Nature Publishing Group: 1392–1401.